🐛 fix: (boxcutter) Enable archival and spec changes for ProgressDeadlineExceeded revisions#2610
Conversation
✅ Deploy Preview for olmv1 ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Pull request overview
Updates the ClusterObjectSet controller’s update-event predicate so that once a ClusterObjectSet hits ProgressDeadlineExceeded, spec-driven updates (identified via metadata.generation changes) are still reconciled while status-only updates remain suppressed—preventing resources from getting stuck and unable to transition (e.g., to Archived).
Changes:
- Adjust
skipProgressDeadlineExceededPredicateto allow updates whenObjectOld.Generation != ObjectNew.Generation. - Preserve the existing behavior of suppressing status-only update churn after
ProgressDeadlineExceeded.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
internal/operator-controller/controllers/clusterobjectset_controller.go
Outdated
Show resolved
Hide resolved
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #2610 +/- ##
==========================================
- Coverage 68.91% 63.22% -5.70%
==========================================
Files 139 139
Lines 9863 9894 +31
==========================================
- Hits 6797 6255 -542
- Misses 2559 3142 +583
+ Partials 507 497 -10
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
6b78b23 to
426beb1
Compare
| func (c *ClusterObjectSetReconciler) SetupWithManager(mgr ctrl.Manager) error { | ||
| skipProgressDeadlineExceededPredicate := predicate.Funcs{ | ||
| UpdateFunc: func(e event.UpdateEvent) bool { | ||
| rev, ok := e.ObjectNew.(*ocv1.ClusterObjectSet) |
There was a problem hiding this comment.
I would extract the whole update func into standalone function, not just the logic under shouldAllowProgressDeadlineExceededUpdate
There was a problem hiding this comment.
@pedjak
I think we might can remove the predicate.
See the current proposal now.
WDYT? I tried to update the PR desc as well
426beb1 to
59eaf3d
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
Maybe I misunderstand the progress deadline exceeded stuff. But I thought exceeding the deadline literally only meant that we set Progressing=False? But otherwise all other functionality and decision making around reconciliation of the CE works the same as it does prior to the deadline exceeding. If that's not the case, why? |
59eaf3d to
b2da731
Compare
b2da731 to
bf9e524
Compare
|
Hi @joelanford I changed the proposal here. I think it might be a better way to solve it. |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
internal/operator-controller/controllers/clusterobjectset_controller.go
Outdated
Show resolved
Hide resolved
internal/operator-controller/controllers/clusterobjectset_controller_internal_test.go
Outdated
Show resolved
Hide resolved
bf9e524 to
38e7024
Compare
pedjak
left a comment
There was a problem hiding this comment.
Three suggestions:
-
Add a comment explaining the
ReasonSucceededcarve-out. The guard usesreason != ReasonSucceeded, which means any new reason added in the future is silently blocked by default. That's the safe default, but it's a hidden constraint — a short comment explaining why onlySucceededis exempt would prevent a future contributor from accidentally breaking this. -
Add a debug log in the early return. When the guard fires, it silently swallows a state transition with zero signal. A
V(1)log line like"skipping markAsProgressing: ProgressDeadlineExceeded is sticky"would help debugging when someone is troubleshooting why a COS condition isn't updating. -
Consider an integration test for the archival scenario (follow-up). The unit tests cover
markAsProgressingwell, but the motivating bug (archival blocked after deadline exceeded) isn't tested e2e. A test simulating deadline exceeded →lifecycleState: Archived→ verify reconcile proceeds would cover the actual user-facing scenario.
d2e5ecb to
d881e80
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
internal/operator-controller/controllers/clusterextension_reconcile_steps.go
Show resolved
Hide resolved
592dcd2 to
7452c0b
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 6 out of 6 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
internal/operator-controller/controllers/clusterextension_reconcile_steps.go
Outdated
Show resolved
Hide resolved
d840c14 to
d2dba24
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.
Comments suppressed due to low confidence (1)
internal/operator-controller/controllers/clusterextension_reconcile_steps.go:1
- The digest is sensitive to ordering in slices/arrays (e.g.,
Channels,MatchExpressions) even when different orderings may be semantically equivalent. This can cause unnecessary “spec changed → re-resolve” behavior. Consider canonicalizing inputs before hashing (e.g., sortChannels, and convertmetav1.LabelSelectorto a normalized string form viametav1.LabelSelectorAsSelector(...).String()or explicit sorting of requirements) to reduce unintended churn.
/*
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
internal/operator-controller/controllers/clusterextension_reconcile_steps.go
Show resolved
Hide resolved
internal/operator-controller/controllers/clusterextension_reconcile_steps.go
Show resolved
Hide resolved
2fa9f67 to
96d0dc1
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
internal/operator-controller/controllers/clusterextension_reconcile_steps.go
Outdated
Show resolved
Hide resolved
internal/operator-controller/controllers/clusterextension_reconcile_steps.go
Show resolved
Hide resolved
96d0dc1 to
a4ce4b3
Compare
This commit addresses the reconcile loop and archival timeout issues when ProgressDeadlineExceeded is set on a ClusterObjectSet. Generated-By: Claude
a4ce4b3 to
489155d
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 9 out of 9 changed files in this pull request and generated no new comments.
Comments suppressed due to low confidence (1)
internal/operator-controller/controllers/clusterobjectset_controller.go:97
- The progress-deadline enforcement currently triggers for any Progressing=True reason (anything except Succeeded). That means if reconcile hits a transient error after the deadline and sets Progressing=True/Retrying, this block will immediately overwrite it back to ProgressDeadlineExceeded, making Retrying status effectively invisible after the deadline. If the intent is to keep ProgressDeadlineExceeded sticky only for RollingOut (while still allowing Retrying/Succeeded to surface), consider applying the deadline check only when the Progressing reason is RollingOut (or when the revision engine reports InTransition), and skip overriding when the reason is Retrying/Archived/etc.
// Progress deadline enforcement only applies to Active revisions, not Archived ones.
// Archived revisions may need to retry teardown operations and should not have their
// Progressing condition or requeue behavior overridden by the deadline check.
if pd := existingRev.Spec.ProgressDeadlineMinutes; pd > 0 && reconciledRev.Spec.LifecycleState != ocv1.ClusterObjectSetLifecycleStateArchived {
cnd := meta.FindStatusCondition(reconciledRev.Status.Conditions, ocv1.ClusterObjectSetTypeProgressing)
isStillProgressing := cnd != nil && cnd.Status == metav1.ConditionTrue && cnd.Reason != ocv1.ReasonSucceeded
succeeded := meta.IsStatusConditionTrue(reconciledRev.Status.Conditions, ocv1.ClusterObjectSetTypeSucceeded)
// check if we reached the progress deadline only if the revision is still progressing and has not succeeded yet
if isStillProgressing && !succeeded {
timeout := time.Duration(pd) * time.Minute
if c.Clock.Since(existingRev.CreationTimestamp.Time) > timeout {
// progress deadline reached, reset any errors and stop reconciling this revision
markAsNotProgressing(reconciledRev, ocv1.ReasonProgressDeadlineExceeded, fmt.Sprintf("Revision has not rolled out for %d minute(s).", pd))
reconcileErr = nil
res = ctrl.Result{}
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // Calculate digest of the resolution inputs for change detection | ||
| digest, err := calculateResolutionDigest(ext.Spec.Source.Catalog) | ||
| if err != nil { | ||
| l.Error(err, "failed to calculate resolution digest, continuing without it") |
There was a problem hiding this comment.
If we can continue, perhaps this should be just info?
How it works on main
The COS reconciler checks if a revision took too long to roll out. If it did, it sets
Progressing=False/ProgressDeadlineExceeded.To avoid a reconcile loop after that,
mainhas a watch predicate that blocks allupdates to COS objects with ProgressDeadlineExceeded. The predicate only allows
deletions through.
What is the problem?
The predicate blocks too much. It blocks all updates, not just status updates.
Example: a new revision rolls out and tries to archive the old stuck one by patching
lifecycleState: Archived. The predicate sees ProgressDeadlineExceeded and drops theevent. The old revision never gets archived.
Example scenario
ProgressDeadlineMinutes, the reconciler setsProgressing=False/ProgressDeadlineExceeded.lifecycleState: Archivedso the old revision gets cleaned up.
COS-rev-1 never reconciles, never processes the archival, and stays stuck forever.
Why does the reconcile loop happen?
Every time the reconciler runs on a deadline-exceeded COS, it calls
markAsProgressing()which sets
Progressing=True. Then the deadline check immediately sets it back toProgressing=False/ProgressDeadlineExceeded. This back-and-forth produces a statusupdate, which triggers another reconcile, and so on.
The predicate was added to break this loop. But it also breaks archiving.
What is the fix?
1. Make ProgressDeadlineExceeded sticky (prevents reconcile loop)
Once ProgressDeadlineExceeded is set, markAsProgressing() will not overwrite
it with RollingOut for Active revisions. This prevents the status flip-flop
that caused the reconcile loop.
The guard only blocks RollingOut. Other reasons (Succeeded, Retrying) can
still update the condition because they represent meaningful state changes.
With no flip-flop, there are no unnecessary status updates, so no reconcile
loop. The watch predicate is removed.
2. Skip progress deadline enforcement for Archived revisions
The deadline check now skips Archived revisions:
This allows Archived revisions to retry teardown operations and update their
status without being overridden by the deadline check.
3. Detect spec changes during rollout (resolution digest)
A SHA256 digest of all catalog filter inputs (packageName, version, channels,
selector, upgradeConstraintPolicy) is stored in the revision's
olm.operatorframework.io/resolution-digest annotation.
Before reusing a rolling-out revision, the controller compares the current
spec's digest with the revision's digest. If they differ, a new revision is
created.
This ensures spec changes are not ignored when a revision has
ProgressDeadlineExceeded.
How it works
ProgressDeadlineExceeded
Motivation
Addresses feedback from:
openshift/operator-framework-operator-controller#687 (comment)